IEEE Journal of Biomedical and Health Informatics
● Institute of Electrical and Electronics Engineers (IEEE)
Preprints posted in the last 30 days, ranked by how well they match IEEE Journal of Biomedical and Health Informatics's content profile, based on 14 papers previously published here. The average preprint has a 0.10% match score for this journal, so anything above that is already an above-average fit.
Sheriff, A.
Show abstract
Over 54 million Americans are aged 65+, with depression affecting 25-49% and anxiety exceeding 30% of assisted living residents. AI systems employing agentic orchestration exhibit 0.5-2% failure rates--unacceptable where a single missed crisis can be fatal. We designed and bench-evaluated Lilo Engine, a 5-layer deterministic therapeutic pipeline replacing a prior multi-agent orchestrator. Safety is enforced through structural invariants: a Guardian layer with 4-gate OR crisis detection runs unconditionally on every input; a Reflector layer validates every output. Evaluated across 3,720 test scenarios, the system achieved 100% crisis recall (500/500 comprehensive scenarios), <5% false positive rate, and 28.7 ms detection latency--well within crisis response benchmarks. Intent classification reached 96.4% accuracy; generation quality 98.4%. The architecture reduced execution paths from 7+ to exactly 2, producing deterministic, HIPAA-auditable traces. Clinical validation with elderly populations is the essential next step.
Wu, Y. C.; Yin, M.; Shi, B.; Zhang, Z.; Yin, D.; Wang, X.; Wang, Y.; Fan, J.; Jin, R.; Wang, H.; Ying, K.; Pang, K.; Rojansky, R.; Curtis, C.; Bao, Z.; Wang, M.; Cong, L.
Show abstract
Medicine historically separates abstract clinical reasoning from physical intervention. We bridge this divide with MedOS, a general-purpose embodied world model. Mimicking human cognition via a dual-system architecture, MedOS demonstrates superior reasoning on biomedical benchmarks and autonomously executes complex clinical research. To extend this intelligence physically, the system simulates medical procedures as a physics-aware model to foresee adverse events. Generating and validating on the MedSuperVision benchmark, MedOS exhibits spatial intelligence for reasoning and action. Crucially, we demonstrate that this platform democratizes clinical expertise and narrows the performance gap between junior and senior physicians. MedOS transforms clinical intervention towards a collaborative discipline where human intuition and machine intelligence co-evolve.
Wieczorek, J.; Jiang, X.; Palade, V.; Trela, J.
Show abstract
Data scarcity and stylistic heterogeneity pose major challenges for emotion intensity classification. This paper presents a cross-dataset augmentation framework that leverages prompt-conditioned generative models alongside deterministic and heuristic transformations to synthesize target-style examples for improved transfer learning. We introduce a unified taxonomy of augmentation strategies--Heuristic Lexical Perturbation (HLA), Prompt-Conditioned Generative Augmentation (CGA), Sequential Hybrid Pipeline (SHA), Rule-Guided Style Adaptation (DSGA), and Enhanced Hybrid Augmentation (EHA)--and detail an interpretability-oriented prompt engineering approach that conditions LLMs on authentic target exemplars and stylistic features extracted from the target dataset. Augmented datasets were evaluated using multi-dimensional quality metrics (transformation quality, stylistic consistency, BLEU/CHRF, Self-BLEU, uniqueness) and downstream classification via a two-phase BERT-LSTM training with rigorous statistical testing. During source dataset pretraining and subsequent target dataset fine-tuning, CGA achieved the highest single-method gains in F1 and accuracy (F1 = 0.8816; accuracy = 0.8819, 95% CI recalculated). HLA and SHA exhibited improved cross-domain stability, suggesting stronger domain-generalizable features. We observe systematic trade-offs between fluency, lexical diversity, and emotion fidelity: high surface similarity often correlates with classifier performance but does not fully capture affective authenticity. We discuss methodological pitfalls, propose best practices for emotion-aware augmentation, and provide reproducible artifacts (prompts, example transformations, evaluation scripts) to facilitate further research in affective NLP.
Ray, P.
Show abstract
Thyroid carcinoma is one of the most prevalent endocrine malignancies worldwide, and accurate preoperative differentiation between benign and malignant thyroid nodules remains clinically challenging. Diagnostic methods that medical practitioners use at present depend on their personal judgment to evaluate both imaging results and separate clinical tests, which creates inconsistency that leads to incorrect medical evaluations. The combination of radiological imaging with clinical information systems enables healthcare providers to enhance their capacity to make reliable predictions about patient outcomes while improving their decision-making abilities. The study introduces a deep learning framework that utilizes multiple data sources by combining magnetic resonance imaging (MRI) data with clinical text to predict thyroid cancer. The system uses a Vision Transformer (ViT) to obtain advanced MRI scan features, while a domain-adapted language model processes clinical documents that contain patient medical history and symptoms and laboratory results. The cross-modal attention system enables the system to merge imaging data with textual information from different sources, which helps to identify how the two types of data are interconnected. The system uses a classification layer to classify the fused features, which allows it to determine the probability of cancerous tumors. The experimental results show that the proposed multimodal system achieves better results than the unimodal base systems because it has higher accuracy, sensitivity, specificity, and AUC values, which help medical personnel to make better preoperative decisions.
Sharma, K.; Sivadas, H.; Reddy, S.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWEmergency Department triage is a critical decision-making process in which clinicians must rapidly assess patient acuity under high cognitive load and time pressure. We present ED-Triage-Agent (ETA), a multi-agent AI framework designed to augment clinical decision-making in Emergency Severity Index (ESI) classification through human-AI collaboration. The system operates in two phases: (1) autonomous patient intake via a conversational agent that collects structured symptom histories and (2) collaborative acuity assessment in which specialized agents prioritize patients for vital sign collection and generate ESI classifications with explicit clinical reasoning. Unlike monolithic AI prediction systems, ETA mirrors clinical workflow by supporting decisions at each triage stage while preserving clinician autonomy. We describe the system architecture, agent design principles, and a preliminary evaluation methodology using the ESI Implementation Handbook case studies (60 standardized cases). This work proposes a model for deploying multi-agent AI systems in time-critical clinical environments where explainability and human oversight are essential. Code and the evaluation framework are available at https://github.com/Karthick47v2/ED-Triage-Agent.
Yousaf, M. N.; Anwar, M. N.; Naveed, N.; Haider, U.
Show abstract
BackgroundTinnitus affects a substantial proportion of the global population and can severely disrupt sleep, mood, and daily functioning, yet the quality of mobile health apps designed for tinnitus management remains highly variable. Traditional evaluation methods, including clinical trials, expert rating scales, and small-scale surveys, rarely capture large-scale, feature-level feedback from real-world users, leaving a gap in understanding which app characteristics drive sustained engagement and satisfaction. MethodsThis study analysed 342,520 English-language reviews from 84 tinnitus-related apps on iOS and Android collected between 2015 and 2025. A pipeline first applied VADER-based preprocessing and sentiment assignment, then trained a graph neural network aspect-based sentiment analysis (GNN-ABSA) model operating on sentence-level dependency graphs to infer feature-level sentiment for domains such as sound therapy, sleep support, pricing, advertisements, stability, and user interface. ResultsThe GNN-ABSA model achieved an accuracy of 84.4% and a macro F1 score of 0.829 on unseen aspect-level test data, indicating stable performance across sentiment classes. Therapeutic features like sound masking and sleep support were associated with predominantly positive sentiment, whereas pricing, advertisements, background playback, and technical stability attracted more neutral or negative feedback over the ten-year period. ConclusionsLarge-scale, graph-based feature-level sentiment analysis provides a user-cantered perspective that complements clinical trials and expert app quality ratings, offering actionable guidance for developers seeking to prioritize design improvements, supporting clinicians in recommending suitable apps to patients, and informing the design of more explainable and user-driven digital health tools. Trial RegistrationNot applicable. This study analysed publicly available app store reviews and did not involve human participants.
Deng, L.; Chen, L.; Liu, M.
Show abstract
Large language models (LLMs) perform strongly across a wide range of medical applications, yet it remains unclear whether such success reflects genuine understanding of medical concepts. We present an ontology-grounded, concept-centered evaluation of medical concept understanding in LLMs. Using 6,252 phenotype concepts from Human Phenotype Ontology, we decompose concept understanding into three core dimensions--concept identity, concept hierarchy, and concept meaning--and design corresponding benchmarks for each dimension. Across a representative set of contemporary LLMs, best-performing models achieve high accuracy on concept identity (90.6%) and hierarchy (83.8%), but lower performance on concept meaning (72.6%). Concept-level analysis reveals substantial fragmentation in LLM understanding: only 57.7% of concepts are consistently understood across all three dimensions, while 41.3% show partial understanding and 1.1% are not captured in any dimension. These results demonstrate that strong application-level performance of LLMs can mask fundamental gaps in concept-level understanding, highlighting the necessity for ontology-grounded evaluation in medical AI.
Lagunas, A.; Chen, P.-J.; Bruns, T. M.; Gupta, P.
Show abstract
ObjectiveThis study aimed to characterize the activation of lower urinary tract (LUT) targets in response to pudendal nerve stimulation (PNS) in awake human participants. Materials and MethodsIn this single center study, recruited participants had an implanted pudendal neurostimulator for treatment of their symptoms including overactive bladder, incontinence, urinary retention, and/or pelvic pain. Participants came in for a modified urodynamic study where a multichannel manometry catheter was placed in the lower urinary tract alongside a dual sensor urodynamics catheter. The bladder was filled and after each participant expressed a strong desire to void, PNS was applied and LUT pressures were measured. Participants attempted voids with the catheters in place to characterize LUT behavior and voiding efficiency with and without stimulation. ResultsThe study consisted of 15 participants including 13 women. Across 133 total trials contractions were observed at the distal urethra 52 times (39%) and at the proximal urethra 46 times (35%). The maximum observed pressure change occurred significantly more often at the proximal urethra than the distal urethra (p = 0.007). There was a significantly higher maximum tolerable stimulation amplitude for low frequency stimulation (2-3.1 Hz) when compared to high frequency stimulation (30-33 Hz) (p = 0.041). In one participant there were four instances of stimulation driven bladder contractions with an average pressure change of 24.3 cmH2O (standard deviation = 10.5). There was not a significant difference in voiding efficiency or maximum flow rate with and without stimulation (p = 0.76 and p = 0.45, respectively). ConclusionsPNS can affect LUT pressures at tolerable stimulation amplitudes. The absence of an effect of PNS on voiding characteristics suggests a similar mechanism of action as sacral neuromodulation.
He, K.; Fang, Y.; Frank, E.; Li, C.; Bohnert, A.; Sen, S.; Wang, M.
Show abstract
Health behaviors such as physical activity and sleep affect mental health, but the effect of each health behavior varies substantially across individuals, limiting the usefulness of generic behavioral recommendations. We collected one year of continuous wearable and ecological momentary assessment data from 3,139 participants in the Intern Health Study (2018-2023), and examined individual-level associations between wearable-derived features and mood across the internship year. The behaviors associated with mood were highly heterogeneous between individuals: the two most prevalent drivers of mood were wake-up time (the strongest driver for 34.0% of subjects) and step count (10.6% of subjects). The correlation directionality remained largely stable despite fluctuations in strength. Interestingly, 20.3% of subjects showed no significant correlations. These findings highlight the limitations of population-level recommendations and the critical need for personalized, data-driven approaches to mental health assessment and intervention. To translate these personalized insights into actionable support, we developed MoodDriver, a large language models (LLM)-powered system that generates tailored feedback emails based on each participants behavioral and physiological patterns. This work demonstrates the feasibility of combining digital phenotyping with large language models to advance precision digital mental health for high-risk populations.
Guo, Y.; Zhou, Y.; Hu, D.; Sutari, S.; Chow, E.; Tam, S.; Perret, D.; Pandita, D.; Zheng, K.
Show abstract
Ambient AI documentation tools generate draft notes that clinicians can review and edit before signing off in electronic health records. Scalable computational approaches to characterize how clinicians modify drafts remain limited, yet are essential for evaluating and improving AI effectiveness. We examined the feasibility of a few-shot prompted large language model (LLM) for categorizing sentence-level edits between AI drafts and final documentation. We developed five label-specific binary models targeting medication, symptom, diagnosis, orders/tests/procedures, and social history edits, and refined prompts using adversarial negatives and verification gates. Evaluation was performed against a human-annotated corpus. Medication and symptom models achieved promising performance (F1=0.787 and 0.780), whereas remaining models were precision-limited. Errors clustered in long, complex edits and category-boundary ambiguity. Therefore, prompt engineering is reliable for categorizing edits with explicit clues, while for complex context-dependent categories they are better suited for triage by labeling edits for human review.
ALI, H.; Woitek, R.; Trattnig, S.; Zaric, O.
Show abstract
Sodium (23Na) magnetic resonance imaging (MRI) provides valuable metabolic information, but it is limited by a low signal-to-noise ratio (SNR) and long acquisition times. To overcome these challenges, we present a Deep Image Prior (DIP)-based framework that combines anatomically guided proton (1H) MRI and metabolically guided 23Na MRI denoising via a fused proton-sodium prior within a directional total variation (dTV) regularization scheme. The DIP-Fusion approach minimizes a variational loss function combining data fidelity, fused dTV regularization, gradient consistency, and bias-field correction to reconstruct sodium images. MRI data were acquired from healthy volunteers and breast cancer patients. Healthy datasets were retrospectively undersampled at multiple factors, and fully sampled scans served as the ground truth. Patient datasets acquired for clinical purposes were reconstructed using the baseline DIP and the proposed DIP-Fusion methods. Sodium images were reconstructed using sum-of-squares (SoS) and adaptive combined (ADC) coil combination methods. We evaluated reconstruction performance using quantitative image quality metrics, including peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), mean squared error (MSE), learned perceptual image patch similarity (LPIPS), feature similarity index (FSIM), and Laplacian focus. In healthy volunteers, DIP-Fusion outperformed state-of-the-art reconstruction techniques across all undersampling factors. In patient datasets, DIP-Fusion demonstrated superior performance compared with baseline DIP, achieving improved structural fidelity and sodium-specific signal preservation. These results demonstrate the potential for robust, highquality sodium MRI reconstruction under accelerated acquisition, which could lead to reduced scan times and enhanced clinical feasibility.
Francis, A. J. A.; Raza, A.; Patel, N.; Gajbhiye, R.; Kumar, V.; T, A.; Saikia, A.; Mibang, O.; K, V.; Joshi, K.; Tony, L.; Balasubramani, P. P.
Show abstract
The rapid growth of tele-counseling and the use of lay counselors in high-volume, low-resource mental health services has created a need for scalable tools for early detection and triage. Effective personalization now requires stratifying individuals by dominant symptom profiles, such as appetite, agency, anxiety, and sleep disturbances. Depression symptoms vary widely, even among those with similar scores, reflecting distinct psychophysiological and cognitive-affective patterns. In tele-mental-health settings, where contextual cues are limited, multimodal behavioral signals from natural interactions can complement traditional assessments. Using synchronized audio, video, and text data from the EDAIC dataset (N=275), we propose a multimodal learning framework to classify five clinically validated outcomes: Depression, Appetite disturbance, Agency impairment, Anxiety, and Sleep problems. We developed a comprehensive multimodal machine-learning pipeline, incorporating automated dataset construction, modality-specific feature extraction (acoustic, facial action unit, linguistic), and supervised learning with cross-validation. Labels were derived from validated scoring rules to ensure clinical relevance. Sentiment analysis revealed lower sentiment scores in participants with high Depression, Anxiety, or Agency scores, but no significant differences in Appetite or Sleep severity. Model performance was assessed across three scenarios: text (transcripts), phone calls (audio + transcript), and video calls (audio + video + transcript). Temporal models (CNN+BiLSTM) achieved over 65% accuracy across modalities, while a fine-tuned temporal model for depression detection using video calls reached an accuracy of 81% with an f1-score of 0.79, demonstrating that our approach performs on par with state-of-the-art methods. XGBoost excelled in phone and video calls, while Ridge classifiers performed best for text-based inputs. SHAPley analysis identified key audio and video features for detecting Depression and other symptoms. A translational avatar-based interface validated system operability, demonstrating the potential for scalable, objective mental-health assessment in tele-counseling.
Ekram, T. T.
Show abstract
BackgroundLarge language models (LLMs) are increasingly deployed in medical contexts as patient-facing assistants, providing medication information, symptom triage, and health guidance. Understanding their robustness to adversarial inputs is critical for patient safety, as even a single safety failure can lead to adverse outcomes including severe harm or death. ObjectiveTo systematically evaluate the safety guardrails of state-of-the-art LLMs through adversarial red-teaming specifically designed for medical contexts. MethodsWe developed a comprehensive taxonomy of 8 adversarial attack categories targeting medical AI safety, encompassing 24 distinct sub-strategies. Using an LLM-based attack generator, we created 160 realistic adversarial prompts across categories including dangerous dosing, contraindication bypass, emergency misdirection, and multi-turn escalation. We tested multiple leading LLMs (Claude Sonnet 4.5, GPT-5.2, Gemini 2.5 Pro, Gemini 3 Flash) using both single-turn and multi-turn attack sequences. All models received identical, standard medical assistant system prompts. An automated evaluator (Claude Sonnet 4.5) pre-screened responses for harm potential (0-5 scale) and guardrail effectiveness, with physician review planned for high-risk responses (harm_level [≥] 3). ResultsOf 160 adversarial prompts evaluated against Claude Sonnet 4.5, 11 (6.9%) elicited responses meeting our threshold for clinically significant harm (harm level [≥] 3 on a 0-5 scale). The model exhibited full refusal behavior in 86.2% of cases. Authority Impersonation was the dominant attack vector (45.0% success rate),s with the "Educational Authority" sub-strategy (framing requests as medical student questions) achieving 83.3% success -- the highest of any sub-strategy. Multi-turn escalation attacks achieved 0% success (0/20). Six of eight attack categories yielded no successful attacks. Physician review of the 11 flagged high-harm cases is in progress. ConclusionsStandard medical assistant system prompts provide strong baseline protection against most adversarial attacks, but are substantially vulnerable to authority impersonation -- particularly claims of educational context. The primary failure mode is behavioral mode-switching: the model provides clinically accurate but safety-framed-inadequately responses when it perceives a professional audience, rather than providing factually incorrect information. This suggests that guardrail improvements should target context-conditioned behavior rather than factual accuracy alone. Our open-source taxonomy and evaluation pipeline enable ongoing adversarial assessment as medical AI systems evolve. ImpactThis work provides the first systematic taxonomy and evaluation framework for medical AI adversarial testing, enabling developers to identify and remediate safety gaps before deployment. Our open-source attack taxonomy and methodology can serve as a foundation for ongoing red-teaming efforts as medical AI systems continue to evolve.
Balakrishna, K.; Hammond, A.; Cheruku, S.; Das, A.; Saggu, M.; Thakur, N. A.; Urrea, R.; Zhu, H.
Show abstract
I.AO_SCPLOWBSTRACTC_SCPLOWCoronary Artery Disease (CAD) is a leading cause of cardiovascular-related mortality and affects 20.5 million people in the United States and approximately 315 million people worldwide in 2022. The asymptomatic and progressive nature of CAD presents challenges for early diagnosis and timely intervention. Traditional diagnostic methods such angiography and stress tests are known to be resource-intensive and prone to human error. This calls for a need for automated and time-effective detection methods. In this paper, this paper introduces a novel approach to the diagnosis of CAD based on a Convolutional Neural Network (CNN) with a temporal attention mechanism. The model will be developed on an architecture that will automatically extract and emphasize critical features from sequential medical imaging data from coronary angiograms, allowing subtle signs of CAD to be easily spotted, which could not have been detected by convention. The temporal attention mechanism strengthens the ability of a model to focus on relevant temporal patterns, thus improving sensitivity and robustness in detecting CAD for various stages of the disease. Experimental validation on a large and diverse dataset demonstrates the efficacy of the proposed method, with significant improvements in both detection accuracy and processing time compared to traditional CNN architectures. The results of this study propose a scalable solution system for the diagnosis of CAD. This proposed system can be integrated into clinical workflows to assist healthcare professionals. Ultimately, this research contributes to the field of AI-driven healthcare solutions and has the potential to reduce the global burden of CAD through early automated detection.
Specht, B.; Garbaya, S.; Ermis, O.; Schneider, R.; Chavarriaga, R.; Khadraoui, D.; Tayeb, Z.
Show abstract
Cross-device medical federated learning--where individual patients participate directly rather than institutions--poses a unique challenge: each client holds only a few samples, often just one (e.g., a single diagnostic record), leaving insufficient local data for gradient computation. Existing approaches, such as Secure Aggregation, require client-to-client coordination impractical for intermittently available mobile devices, while homomorphic encryption introduces substantial computational overhead. We present privateboost, a federated XGBoost system that addresses this setting through m-of-n Shamir secret sharing with commitment-based anonymous aggregation. Clients distribute shares to a fixed set of shareholders--requiring no client-to-client communication--and the aggregator reconstructs only aggregate gradient sums via Lagrange interpolation, never observing individual values or client identities. We evaluate on UCI medical datasets, demonstrating 98% split gain retention relative to centralized XGBoost and accuracy resilient to up to 80% client dropout.
Auger, S. D.; Graham, N. S. N.; Scott, G.
Show abstract
Structured AbstractO_ST_ABSBackgroundC_ST_ABSHosting large language models (LLMs) on-premises can secure patient data but requires compact architectures to function on standard hardware. The impact of such constraints on the robustness of their representations for medical terminology is important for clinical AI safety but poorly understood. The statistical nature of LLM training inherently limits the representation of terms with low societal prominence or lexical frequency, and high ambiguity. MethodsWe assessed 15 open-weights LLMs (4B-120B) for their representational robustness of 250 neurological terms. Neurology was chosen for its strict hierarchical and anatomical terminology. A terms representation was deemed robust only if the model correctly navigated four tests, verifying valid links against distractors and reverse associations. We examined associations between representational robustness and model size, medical fine-tuning, and five terminological subdomains (localisation, clinical features, investigations, diagnoses, and treatments). We assessed term difficulty using the semantic complexity index (SCI), a novel composite integrating societal prominence, lexical frequency, and ambiguity. ResultsRepresentational robustness followed a log-linear scaling law relative to model size (r=0.736, p=0.002). Medical fine-tuning yielded no benefit for 4B models, but significantly improved larger 27B model performance, with rate of robust representations rising from 38.2% to 62.6% (p<0.0001). While most local-LLMs performance degraded sharply with increasing SCI values, GPT-OSS 20B and 120B maintained complexity invariance (with <20% decline from lowest to highest complexity terms). Notably, the general-purpose 20B GPT-OSS model outperformed larger and medically fine-tuned counterparts. Robustness varied by subdomain (F=4.69, p=0.003), with diagnoses (73.8%) scoring significantly higher than localisation (47.9%, p=0.004) and clinical features (52.1%, p=0.02). ConclusionsWhile representational robustness broadly follows model size scaling laws, neither model size nor fine-tuning guarantees clinical reliability. Since performance fluctuates with terminological complexity and subdomain, safe deployment requires validating representational robustness for specific use cases rather than assuming larger models handle medical language safely. 1-2 Sentence DescriptionThis study shows that model size and medical fine-tuning are not reliable indicators of clinical robustness across 15 locally-deployable LLMs. Because performance varies significantly by terminological complexity and subdomain, safe application requires validation methods that account for these factors.
Pham, T. D.
Show abstract
ObjectiveThis study investigates whether incorporating physiological coupling concepts into neural network design can support stable and interpretable feature learning for histopathological image classification under limited data conditions. MethodsA physiologically inspired architecture, termed CardioPulmoNet, is introduced to model interacting feature streams analogous to pulmonary ventilation and cardiac perfusion. Local and global tissue features are integrated through bidirectional multi-head attention, while a homeostatic regularization term encourages balanced information exchange between streams. The model was evaluated on three histopathological datasets involving oral squamous cell carcinoma, oral submucous fibrosis, and heart failure. In addition to end-to-end training, learned representations were assessed using linear support vector machines to examine feature separability. ResultsCardioPulmoNet achieved performance comparable to several pretrained convolutional neural networks across the evaluated datasets. When combined with a linear classifier, improved classification performance and higher area under the receiver operating characteristic curve were observed, suggesting that the learned feature embeddings are well structured for downstream discrimination. ConclusionThese results indicate that physiologically motivated architectural constraints may contribute to stable and discriminative representation learning in computational pathology, particularly when training data are limited. The proposed framework provides a step toward integrating physiological modeling principles into medical image analysis and may support future development of transferable and interpretable learning systems for histopathological diagnosis.
Alkeyeva, R.; Nagiyev, I.; Kim, D.; Nurmanova, B.; Omarova, Z.; Varol, H. A.; Chan, M.-Y.
Show abstract
BackgroundThe growing interest in applying artificial intelligence in personalized nutrition is challenged by the complex nature of dietary advice that must balance health, economic, and personal factors. Though automated solutions using either Linear Programming (LP) or Large Language Models (LLMs) already exist, they have significant drawbacks. LP often lacks personalization, whereas LLMs can be unreliable for precise calculations. ObjectivesTo develop and assess a model that integrates a Mixed Integer Linear Programming (MILP) solver with an LLM to generate personalized meal plans and compare it with standalone LLM and MILP models. MethodsThe proposed hybrid MILP+LLM model first uses an LLM (GPT-4o) to filter a unified food dataset (n=297), which combines regional Central Asian and global food items, according to the users profile. The filtered list of food items is then received by a MILP solver which identifies the set of top 10 optimal solutions. Finally, given this set of solutions, LLM chooses the most appropriate meal plan. The model was evaluated using five synthesized, clinically complex patient profiles sourced from Adilmetova et al. [4]. The performance of this hybrid model was compared against standalone MILP and LLM using 5-point Likert scale with Kruskal-Wallis and post hoc Dunns tests for Nutrient Accuracy, Personalization, Practicality, and Variety. ResultsFindings demonstrated that the proposed MILP+LLM model reached balanced performance achieving scores of more than 3.6 points in all criteria, with high scores in Nutrient Accuracy (3.96), Personalization (3.81), and Practicality (3.99). The standalone LLM model performed the weakest in all criteria, with statistically significant lower scores compared to the other two methods. The standalone MILP model performed best in Nutrient Accuracy (4.93) and in Variety (4.10) but lagged behind the MILP+LLM model in Practicality and Personalization. Kruskal-Wallis and Dunns tests showed MILP and MILP+LLM outperformed LLM across all criteria. MILP was more accurate (p<0.0001), while MILP+LLM model was more practical (p=0.021). ConclusionsThe findings suggest that integrating the LLM with the MILP solver creates a model that combines qualitative personalization with quantitative precision. This model produces comprehensive, reliable meal plans, addressing the limitations of using either model alone.
Ajadi, N. A.; Afolabi, S. O.; Adenekan, I. O.; Jimoh, A. O.; Ajayi, A. O.; Adeniran, T. A.; Adepoju, G. D.; Hassan, N. F.; Ajadi, S. A.
Show abstract
This research presents multimodal deep learning for structural heart disease prediction. We evaluated multiple deep learning architectures, including TCN, Simple CNN, ResNet1d18, Light transformer and Hybrid model. The models were examined across the three seeds to ensure robustness, and bootstrap confidence interval is used to measure performance differences. TCN consistently outperforms other competing architectures, achieving statistically significant improvements with stable performance across runs. Similarly in predictive analysis, TCN has efficient computation and stable training compared to all competing architectures. Our results show that TCN emphasizes fairness evaluation when developing deep learning models for healthcare applications.
Islam, N.; Luo, C.; Tong, J.; Polleya, D. A.; Jordan, C. T.; Haverkos, B.; Bair, S.; Kent, A.; Weller, G.
Show abstract
Cox proportional hazard regressions are frequently employed to develop prognostic models for time-to-event data, considering both patient-specific and disease-specific characteristics. In high-dimensional clinical modeling, these biological features can exhibit high collinearity due to inter-feature relationships, potentially causing instability and numerical issues during estimation without regularization. For rare diseases such as acute myeloid leukemia (AML), the sparsity and scarcity of data further complicate estimation. In such cases, data augmentation through multi-site collaboration can alleviate these problems. However, this often necessitates sharing individual patient data (IPD) across sites, which presents challenges due to regulatory barriers aimed at protecting patient privacy. To overcome these challenges, we propose a privacy-preserving algorithm that eliminates sharing IPD across sites and fits a federated penalized piecewise exponential model (FedPPEM) to estimate potential effects of clinical features using summary statistics. This algorithm yields results nearly identical to those from pooled IPD, including effect size and standard error estimates. We demonstrate the models performance in quantifying effects of clinical features and genetic risk classification on overall survival using real-world data from [~]1200 newly diagnosed AML patients across 33 U.S. sites. Although applied in AML context, this model is disease-agnostic and can be implemented in other diseases and clinical contexts.